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Abstract 


We introduce Embed to Control (E2C), a method for model learning and control 
of non-linear dynamical systems from raw pixel images. E2C consists of a deep 
generative model, belonging to the family of variational autoencoders, that learns 
to generate image trajectories from a latent space in which the dynamics is con¬ 
strained to be locally linear. Our model is derived directly from an optimal control 
formulation in latent space, supports long-term prediction of image sequences and 
exhibits strong performance on a variety of complex control problems. 

1 Introduction 

Control of non-linear dynamical systems with continuous state and action spaces is one of the key 
problems in robotics and, in a broader context, in reinforcement learning for autonomous agents. 
A prominent class of algorithms that aim to solve this problem are model-based locally optimal 
(stochastic) control algorithms such as iLQG control dEL which approximate the general non¬ 
linear control problem via local linearization. When combined with receding horizon control (3), and 
machine learning methods for learning approximate system models, such algorithms are powerful 
tools for solving complicated control problems EH31I3; however, they either rely on a known system 
model or require the design of relatively low-dimensional state representations. For real autonomous 
agents to succeed, we ultimately need algorithms that are capable of controlling complex dynamical 
systems from raw sensory input (e.g. images) only. In this paper we tackle this difficult problem. 

If stochastic optimal control (SOC) methods were applied directly to control from raw image data, 
they would face two major obstacles. First, sensory data is usually high-dimensional - i.e. images 
with thousands of pixels - rendering a naive SOC solution computationally infeasible. Second, 
the image content is typically a highly non-linear function of the system dynamics underlying the 
observations; thus model identification and control of this dynamics are non-trivial. 

While both problems could, in principle, be addressed by designing more advanced SOC algo¬ 
rithms we approach the “optimal control from raw images” problem differently: turning the prob¬ 
lem of locally optimal control in high-dimensional non-linear systems into one of identifying a 
low-dimensional latent state space, in which locally optimal control can be performed robustly and 
easily. To learn such a latent space we propose a new deep generative model belonging to the class 
of variational autoencoders Ed that is derived from an iLQG formulation in latent space. The 
resulting Embed to Control (E2C) system is a probabilistic generative model that holds a belief over 
viable trajectories in sensory space, allows for accurate long-term planning in latent space, and is 
trained fully unsupervised. We demonstrate the success of our approach on four challenging tasks 
for control from raw images and compare it to a range of methods for unsupervised representation 
learning. As an aside, we also validate that deep up-convolutional networks dm are powerful 
generative models for large images. 

* Authors contributed equally. 
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2 The Embed to Control (E2C) model 


We briefly review the problem of SOC for dynamical systems, introduce approximate locally optimal 
control in latent space, and finish with the derivation of our model. 

2.1 Problem Formulation 

We consider the control of unknown dynamical systems of the form 

St+1 = /(s t ,u t ) +£, £~V(0, E$), (1) 

where t denotes the time steps, s t G M ns the system state, u t G M nu the applied control and £ 
the system noise. The function /(s*, u t ) is an arbitrary, smooth, system dynamics. We equivalently 
refer to Equation ([!]) using the notation P(s t +i \s t , u t ), which we assume to be a multivariate normal 
distribution St, u t ), 5]^). We further assume that we are only given access to visual depictions 
G R nx of state s t . This restriction requires solving a joint state identification and control problem. 
For simplicity we will in the following assume that is a fully observed depiction of s t , but relax 
this assumption later. 

Our goal then is to infer a low-dimensional latent state space model in which optimal control can 
be performed. That is, we seek to learn a function m, mapping from high-dimensional images x t 
to low-dimensional vectors z t G M n * with n z « n x , such that the control problem can be solved 
using z t instead of x t : 

z t = ra(x t )+o;, u> ~ AT(0, S w ), (2) 

where u> accounts for system noise; or equivalently z t ~ A5]^). Assuming for the moment 
that such a function can be learned (or approximated), we will first define SOC in a latent space and 
introduce our model thereafter. 


2.2 Stochastic locally optimal control in latent spaces 


Let z t G M n * be the inferred latent state from image x t of state s t and / lat (z t , u t ) the transition 
dynamics in latent space, i.e., z t +i = / lat (z^,u t ). Thus / lat models the changes that occur in 
z t when control is applied to the underlying system as a latent space analogue to /(s t ,u t ). 
Assuming / lat is known, optimal controls for a trajectory of length T in the dynamical system can 
be derived by minimizing the function J(zi :T , ui :T ) which gives the expected future costs when 
following (zi :T , ui :T ): 


J (^l:T 5 Ui : t) — E z 


T—1 

c T (z T ,u T ) + c(z t ,u t ) , 

to 


(3) 


where c(z t , u t ) are instantaneous costs, ct(zt, u t) denotes terminal costs and zi : t = {zi,..., z t} 
and ui;T = {ui, ..., u^} are state and action sequences respectively. If z t contains sufficient infor¬ 
mation about St, i.e., s t can be inferred from z t alone, and / lat is differentiable, the cost-minimizing 
controls can be computed from J(zi : t, Ui : t) via SOC algorithms 110). These optimal control al¬ 
gorithms approximate the global non-linear dynamics with locally linear dynamics at each time step 
t. Locally optimal actions can then be found in closed form. Formally, given a reference trajectory 
zi : t - the current estimate for the optimal trajectory - together with corresponding controls Ui : t 
the system is linearized as 

z*+i = A(z t )z t + B(z t )u t + o(z t ) cc? - W(0, D w ), (4) 

where A(z t ) = -- ^ Ut \ B(z t ) = -- are local Jacobians, and o(z t ) is an offset. To 

enable efficient computation of the local controls we assume the costs to be a quadratic function of 
the latent representation 

c(z t , Ut) = (z t - Zgoal) T R 2 (z t - Zgoal) + uf H u U t , (5) 

where H z G R n ^ xn ^ and H u G R nuXriu are cost weighting matrices and z goa i is the inferred 
representation of the goal state. We also assume ct(zt, ut) = c(zt, u t) throughout this paper. 
In combination with Equation ([4]) this gives us a local linear-quadratic-Gaussian formulation at 
each time step t which can be solved by SOC algorithms such as iterative linear-quadratic reg¬ 
ulation (iLQR) mi or approximate inference control (AICO) m. The result of this trajectory 
optimization step is a locally optimal trajectory with corresponding control sequence (zJ ;T , u^ ;T ) « 
argminzi :T J(zi :T , ui :T ). 

Ul :T 


2 







Figure 1: The information flow in the E2C model. From left to right, we encode and decode an 
image x t with the networks h e £ c and h^ c , where we use the latent code z t for the transition step. 
The h^ ns network computes the local matrices At, B t , o t with which we can predict z t +i from z t 
and u t . Similarity to the encoding z t +i is enforced by a KL divergence on their distributions and 
reconstruction is again performed by h^ c . 


2.3 A locally linear latent state space model for dynamical systems 

Starting from the SOC formulation, we now turn to the problem of learning an appropriate low¬ 
dimensional latent representation z t ~ P(Z t |m(x t ), S w ) of x t . The representation z t has to fulfill 
three properties: (i) it must capture sufficient information about x* (enough to enable reconstruc¬ 
tion); (ii) it must allow for accurate prediction of the next latent state z t +i and thus, implicitly, of the 
next observation x t+i ; (iii) the prediction / lat of the next latent state must be locally linearizable/br 
all valid control magnitudes u t . Given some representation z t , properties (ii) and (iii) in particular 
require us to capture possibly highly non-linear changes of the latent representation due to transfor¬ 
mations of the observed scene induced by control commands. Crucially, these are particularly hard 
to model and subsequently linearize. We circumvent this problem by taking a more direct approach: 
instead of learning a latent space z and transition model / lat which are then linearized and combined 
with SOC algorithms, we directly impose desired transformation properties on the representation z t 
during learning. We will select these properties such that prediction in the latent space as well as 
locally linear inference of the next observation according to Equation Q are easy. 

The transformation properties that we desire from a latent representation can be formalized directly 
from the iLQG formulation given in Section [2^2| . Formally, following Equation Q, let the latent 
representation be Gaussian P(Z\X) = A/’(ra(xt), S w ). To infer z t from x t we first require a 
method for sampling latent states. Ideally, we would generate samples directly from the unknown 
true posterior P(Z\X), which we, however, have no access to. Following the variational Bayes 
approach (see Jordan et al. rm for an overview) we resort to sampling z t from an approximate 
posterior distribution Q^[Z\X) with parameters <j). 

Inference model for Q In our work this is always a diagonal Gaussian distribution Q<f,(Z\X) = 
A/*(// t , diag(crf)), whose mean fi t G M n * and covariance T, t = diag(cr^) e M n * xn * are computed 
by an encoding neural network with outputs 

M,. = W M /^ c (x t ) + b M , (6) 

log (Tt = + tv, (7) 

where h e £ c G M ne is the activation of the last hidden layer and where 4> is given by the set of all 
learnable parameters of the encoding network, including the weight matrices W M , and biases 
b M , bo-. Parameterizing the mean and variance of a Gaussian distribution based on a neural network 

gives us a natural and very expressive model for our latent space. It additionally comes with the 

benefit that we can use the reparameterization trick (6[ 7 ] to backpropagate gradients of a loss 
function based on samples through the latent distribution. 

Generative model for Pq. Using the approximate posterior distribution we generate observed 
samples (images) x t and x t+ i from latent samples z t and z t+ i by enforcing a locally linear rela¬ 
tionship in latent space according to Equation 0, yielding the following generative model 

z t ~ I X) = 

Zt +1 ~ \ Z,u) = Af( A t /i t +B t u t + o t ,C(), (8) 

x t ,x t+ i ~ P 0 (X \ Z) = Bernoulli(p t ), 

where Q^ is the next latent state posterior distribution, which exactly follows the linear form re¬ 
quired for stochastic optimal control. With uj t ~ A/"(0,Ht) as an estimate of the system noise, 
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C can be decomposed as C t = A t T, t Af + H t . Note that while the transition dynamics in our 
generative model operates on the inferred latent space, it takes untransformed controls into account. 
That is, we aim to learn a latent space such that the transition dynamics in z linearizes the non-linear 
observed dynamics in x and is locally linear in the applied controls u. Reconstruction of an image 
from z t is performed by passing the sample through multiple hidden layers of a decoding neural 
network which computes the mean of the generative Bernoulli distributiorQP 0 (X|Z) as 

Pt =Wp4»+bp, (9) 

where h^ c (z t ) E M nd is the response of the last hidden layer in the decoding network. The set of 
parameters for the decoding network, including weight matrix W p and bias b p , then make up the 
learned generative parameters 0. 

Transition model for Q^. What remains is to specify how the linearization matrices A t E M n * xriz , 
B t E R n * xn ™ and offset o t E M n * are predicted. Following the same approach as for distribution 
means and covariance matrices, we predict all local transformation parameters from samples z t 
based on the hidden representation /i^ ans ( z t ) E M nt of a third neural network with parameters i/j - 
to which we refer as the transformation network. Specifically, we parametrize the transformation 
matrices and offset as 

vec[A t ] = W A P- ns (z t )+b A , 

vec[B,] = W B P; ans (z t ) + b B , (10) 

ot = W 0 h%™(z t ) + b 0 , 

where vec denotes vectorization and therefore vec[A t ] E and vec[B t ] E R( nz ' nu \ To cir¬ 

cumvent estimating the full matrix A t of size n z x n z , we can choose it to be a perturbation of the 
identity matrix A t = (I + v t rf) which reduces the parameters to be estimated for A t to 2 n z . 

A sketch of the complete architecture is shown in Figure [l] It also visualizes an additional constraint 
that is essential for learning a representation for long-term predictions: we require samples z t +i 
from the state transition distribution Q^ to be similar to the encoding of x t+ i through Q While it 
might seem that just learning a perfect reconstruction of x t+ i from z t +i is enough, we require multi- 
step predictions for planning in Z which must correspond to valid trajectories in the observed space 
X. Without enforcing similarity between samples from Q^ and following a transition in latent 
space from z t with action u t may lead to a point z t+ i, from which reconstruction of x t+ i is possible, 
but that is not a valid encoding (i.e. the model will never encode any image as z t+ i). Executing 
another action in z t+1 then does not result in a valid latent state - since the transition model is 
conditional on samples coming from the inference network - and thus long-term predictions fail. 
In a nutshell, such a divergence between encodings and the transition model results in a generative 
model that does not accurately model the Markov chain formed by the observations. 


2.4 Learning via stochastic gradient variational Bayes 


For training the model we use a data set V = {(xi, ui, x 2 ),..., (x T _i, u^-i, x T )} containing ob¬ 
servation tuples with corresponding controls obtained from interactions with the dynamical system. 
Using this data set, we learn the parameters of the inference, transition and generative model by 
minimizing a variational bound on the true data negative log-likelihood — log P(x t , u t , x t+ i) plus 
an additional constraint on the latent representation. The complete loss functioij^Jis given as 

^{T>)= ^2 £ bound (x t ,u t ,x t+ i) + XKL (p^(Z | n t ,\i t )\\Q<t,(Z \ xj+i)) . (11) 

(x t ,ut,x t +i)ex> 


The first part of this loss is the per-example variational bound on the log-likelihood 

£ bound (x t ,u t ,x t+ i) =E [-logP#(x t |z t ) - logP 0 (x t+1 |z t+ i)]+KL(Q 0 ||P(Z)), (12) 


where Q&, Pp and are the parametric inference, generative and transition distributions from 


Section 


2.3 and P(Z t ) is a prior on the approximate posterior which we always chose to be 


X A Bernoulli distribution for Pq is a common choice when modeling black-and-white images. 
2 Note that this is the loss for the latent state space model and distinct from the SOC costs. 
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an isotropic Gaussian distribution with mean zero and unit variance. The second KL divergence in 
Equation <Q3) is an additional contraction term with weight A, that enforces agreement between the 
transition and inference models. This term is essential for establishing a Markov chain in latent space 
that corresponds to the real system dynamics (see Section [23] above for an in depth discussion). This 
KL divergence can also be seen as a prior on the latent transition model. Note that all KL terms can 
be computed analytically for our model (see supplementary for details). 

During training we approximate the expectation in C(V) via sampling. Specifically, we take one 
sample z t for each input and transform that sample using Equation ( fTO] ) to give a valid sample 
zt+i from Q^. We then jointly learn all parameters of our model by minimizing C(V) using SGD. 


3 Experimental Results 

We evaluate our model on four visual tasks: an agent in a plane with obstacles, a visual version of the 
classic inverted pendulum swing-up task, balancing a cart-pole system, and control of a three-link 
arm with larger images. These are described in detail below. 


3.1 Experimental Setup 


Model training. We consider two different network types for our model: Standard fully connected 
neural networks with up to three layers, which work well for moderately sized images, are used for 
the planar and swing-up experiments; A deep convolutional network for the encoder in combination 
with an up-convolutional network as the decoder which, in accordance with recent findings from 
the literature BID, we found to be an adequate model for larger images. Training was performed 
using Adam lfl4l throughout all experiments. The training data set V for all tasks was generated by 
randomly sampling N state observations and actions with corresponding successor states. Lor the 
plane we used N = 3,000 samples, for the inverted pendulum and cart-pole system we used N = 
15,000 and for the arm iV=30, 000. A complete list of architecture parameters and hyperparameter 
choices as well as an in-depth explanation of the up-convolutional network are specified in the 
supplementary material. We will make our code and a video containing controlled trajectories for all 
systems available under http : //ml. inf ormatik . uni-f reiburg. de/research/e 2 c . 


Model variants. In addition to the Embed to Control (E2C) dynamics model derived above, we 
also consider two variants: By removing the latent dynamics network /^ ans , i.e. setting its output 
to one in Equation ( fT0| ) - we obtain a variant in which A t , and o t are estimated as globally 
linear matrices (Global E2C). If we instead replace the transition model with a network estimating 


the dynamics as a non-linear function / lat and o nly linearize during planning, estimating A t , B t , o t 
as Jacobians to / lat as described in Section 2.2 we obtain a variant with nonlinear latent dynamics. 


Baseline models. For a thorough comparison and to exhibit the complicated nature of the tasks, 
we also test a set of baseline models on the plane and the inverted pendulum task (using the same 
architecture as the E2C model): a standard variational autoencoder (VAE) and a deep autoencoder 
(AE) are trained on the autoencoding subtask for visual problems. That is, given a data set V 
used for training our model, we remove all actions from the tuples in V and disregard temporal 
context between images. After autoencoder training we learn a dynamics model in latent space, 


approximating / lat from Section 2.2 We also consider a VAE variant with a slowness term on the 


latent representation - a full description of this variant is given in the supplementary material. 


Optimal control algorithms. To perform optimal control in the latent space of different models, 
we employ two trajectory optimization algorithms: iterative linear quadratic regulation (iLQR) m 
(for the plane and inverted pendulum) and approximate inference control (AICO) 112] (all other 
experiments). For all VAEs both methods operate on the mean of distributions Q$ and Q^. AICO 
additionally makes use of the local Gaussian covariances T, t and C*. Except for the experiments 
on the planar system, control was performed in a model predictive control fashion using the reced¬ 
ing horizon scheme introduced in m. To obtain closed loop control given an image x t , it is first 
passed through the encoder to obtain the latent state z t . A locally optimal trajectory is subsequently 
found by optimizing (z *. t , T , u^. t+T ) ~ argminz t:t+T J(z t:t+ T, u t-.t+r) with fixed, small horizon 

' ' U t:t + T 

T (with T = 10 unless noted otherwise). Controls are applied to the system and a transition to 
zj+i is observed (by encoding the next image x t +i). Then a new control sequence - with horizon 
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Figure 2: The true state space of the planar system (left) with examples (obstacles encoded as circles) 
and the inferred spaces (right) of different models. The spaces are spanned by generating images for 
every valid position of the agent and embedding them with the respective encoders. 


T - starting in z t+1 is found using the last estimated trajectory as a bootstrap. Note that planning 
is performed entirely in the latent state without access to any observations except for the depiction 
of the current state. To compute the cost function c(z t , u t ) required for trajectory optimization in 
z we assume knowledge of the observation x goa i of the goal state s goa i. This observation is then 
transformed into latent space and costs are computed according to Equation ([5]). 


3.2 Control in a planar system 

The agent in the planar system can move in a bounded two-dimensional plane by choosing a con¬ 
tinuous offset in x- and y-direction. The high-dimensional representation of a state is a 40 x 40 
black-and-white image. Obstructed by six circular obstacles, the task is to move to the bottom right 
of the image, starting from a random x position at the top of the image. The encodings of obstacles 
are obtained prior to planning and an additional quadratic cost term is penalizing proximity to them. 

A depiction of the observations on which control is performed - together with their corresponding 
state values and embeddings into latent space - is shown in Figure [5] The figure also clearly shows 
a fundamental advantage the E2C model has over its competitors: While the separately trained 
autoencoders make for aesthetically pleasing pictures, the models failed to discover the underlying 
structure of the state space, complicating dynamics estimation and largely invalidating costs based 
on distances in said space. Including the latent dynamics constraints in these end-to-end models on 
the other hand, yields latent spaces approaching the optimal planar embedding. 

We test the long-term accuracy by accumulating latent and real trajectory costs to quantify whether 
the imagined trajectory reflects reality. The results for all models when starting from random posi¬ 
tions at the top and executing 40 pre-computed actions are summarized in Table [T]- using a seperate 
test set for evaluating reconstructions. While all methods achieve a low reconstruction loss, the dif¬ 
ference in accumulated real costs per trajectory show the superiority of the E2C model. Using the 
globally or locally linear E2C model, trajectories planned in latent space are as good as trajectories 
planned on the real state. All models besides E2C fail to give long-term predictions that result in 
good performance. 


3.3 Learning swing-up for an inverted pendulum 

We next turn to the task of controlling the classical inverted pendulum system fl5l from images. 
We create depictions of the state by rendering a fixed length line starting from the center of the 
image at an angle corresponding to the pendulum position. The goal in this task is to swing-up and 
balance an underactuated pendulum from a resting position (pendulum hanging down). Exemplary 
observations and reconstructions for this system are given in Figure [3jd). In the visual inverted 
pendulum task our algorithm faces two additional difficulties: the observed space is non-Markov, as 
the angular velocity cannot be inferred from a single image, and second, discretization errors due to 
rendering pendulum angles as small 48x48 pixel images make exact control difficult. To restore the 
Markov property, we stack two images (as input channels), thus observing a one-step history. 

Figure [3] shows the topology of the latent space for our model, as well as one sample trajectory in 
true state and latent space. The fact that the model can learn a meaningful embedding, separating 
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Table 1: Comparison between different approaches to model learning from raw pixels for the planar 
and pendulum system. We compare all models with respect to their prediction quality on a test set 
of sampled transitions and with respect to their performance when combined with SOC (trajectory 
cost for control from different start states). Note that trajectory costs in latent space are not neces¬ 
sarily comparable. The “real” trajectory cost was computed on the dynamics of the simulator while 
executing planned actions. For the true models for s t , real trajectory costs were 20.24 ± 4.15 for the 
planar system, and 9.8 =b 2.4 for the pendulum. Success was defined as reaching the goal state and 
staying e-close to it for the rest of the trajectory (if non terminating). All statistics quantify over 5/30 
(plane/pendulum) different starting positions. A f marks separately trained dynamics networks. 


Algorithm 

State Loss 

logp(x t |xt) 

Next State Loss 

logp(x t+ l|xt,U t ) 

Trajectory Cost 

Latent Real 

Success 

percent 

AE* 

VAE* 

VAE + slowness* 
Non-linear E2C 
Global E2C 

E2C 

11.5 ±97.8 

3.6 ± 18.9 

10.5 ± 22.8 
8.3 ±5.5 

6.9 ± 3.2 

7.7 ± 2.0 

Planar System 

3538.9 ± 1395.2 

652.1 ± 930.6 

104.3 ± 235.8 

11.3 ± 10.1 

9.3 ± 4.6 

9.7 ± 3.2 

1325.6 ±81.2 

43.1 ±20.8 

47.1 ±20.5 
19.8 ±9.8 

12.5 ± 3.9 

10.3 ±2.8 

273.3 ± 16.4 

91.3 ± 16.4 

89.1 ± 16.4 

42.3 ± 16.4 

27.3 ± 9.7 

25.1 ± 5.3 

0% 

0% 

0% 

96.6 % 

100% 

100% 



Inverted Pendulum Swing-Up 



AE* 

8.9 ± 100.3 

13433.8 ± 6238.8 

1285.9 ± 355.8 

194.7 ± 44.8 

0% 

VAE* 

7.5 ± 47.7 

8791.2 ± 17356.9 

497.8 ± 129.4 

237.2 ±41.2 

0% 

VAE + slowness* 

26.5 ± 18.0 

779.7 ± 633.3 

419.5 ± 85.8 

188.2 ±43.6 

0% 

E2C no latent KL 

64.4 ± 32.8 

87.7 ± 64.2 

489.1 ±87.5 

213.2 ±84.3 

0% 

Non-linear E2C 

59.6 ± 25.2 

72.6 ± 34.5 

313.3 ±65.7 

37.4 ± 12.4 

63.33 % 

Global E2C 

115.5 ±56.9 

125.3 ± 62.6 

628.1 ±45.9 

125.1 ± 10.7 

0% 

E2C 

84.0 ± 50.8 

89.3 ± 42.9 

275.0 ± 16.6 

15.4 ± 3.4 

90% 


velocities and positions, from this data is remarkable (no other model recovered this shape). Table [T] 
again compares the different models quantitatively. While the E2C model is not the best in terms of 
reconstruction performance, it is the only model resulting in stable swing-up and balance behavior. 
We explain the failure of the other models with the fact that the non-linear latent dynamics model 
cannot be guaranteed to be linearizable for all control magnitudes, resulting in undesired behav¬ 
ior around unstable fixpoints of the real system dynamics, and that for this task a globally linear 
dynamics model is inadequate. 


3.4 Balancing a cart-pole and controlling a simulated robot arm 

Finally, we consider control of two more complex dynamical systems from images using a six layer 
convolutional inference and six layer up-convolutional generative network, resulting in a 12-layer 
deep path from input to reconstruction. Specifically, we control a visual version of the classical cart- 
pole system m from a history of two 80 x 80 pixel images as well as a three-link planar robot arm 
based on a history of two 128 x 128 pixel images. The latent space was set to be 8-dimensional in 
both experiments. The real state dimensionality for the cart-pole is four and is controlled using one 
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Figure 3: (a) The true state space of the inverted pendulum task overlaid with a successful trajectory 
taken by the E2C agent, (b) The learned latent space, (c) The trajectory from (a) traced out in the 
latent space, (d) Images x and reconstructions x showing current positions (right) and history (left). 
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Figure 4: Left: Trajectory from the cart-pole domain. Only the first image (green) is “real”, all 
other images are “dreamed up” by our model. Notice discretization artifacts present in the real 
image. Right: Exemplary observed (with history image omitted) and predicted images (including 
the history image) for a trajectory in the visual robot arm domain with the goal marked in red. 


action, while for the arm the real state can be described in 6 dimensions (joint angles and velocities) 
and controlled using a three-dimensional action vector corresponding to motor torques. 

As in previous experiments the E2C model seems to have no problem finding a locally linear em¬ 
bedding of images into latent space in which control can be performed. Figure]?] depicts exemplary 
images - for both problems - from a trajectory executed by our system. The costs for these trajec¬ 
tories (11.13 for the cart-pole, 85.12 for the arm) are only slightly worse than trajectories obtained 
by AICO operating on the real system dynamics starting from the same start-state (7.28 and 60.74 
respectively). The supplementary material contains additional experiments using these domains. 


4 Comparison to recent work 

In the context of representation learning for control (see Bohmer et al. ifTTl for a review), deep 
autoencoders (ignoring state transitions) similar to our baseline models have been applied previously, 
e.g. by Lange and Riedmiller CEE). A more direct route to control based on image streams is taken 
by recent work on (model free) deep end-to-end Q-leaming for Atari games by Mnih et al. fT9l.as 
well as kernel based (20) and deep policy learning for robot control [21 ]. 

Close to our approach is a recent paper by Wahlstrom et al. (22) . where autoencoders are used to 
extract a latent representation for control from images, on which a non-linear model of the forward 
dynamics is learned. Their model is trained jointly and is thus similar to the non-linear E2C variant 
in our comparison. In contrast to our model, their formulation requires PCA pre-processing and does 
neither ensure that long-term predictions in latent space do not diverge, nor that they are linearizable. 

As stated above, our system belongs to the family of VAEs and is generally similar to recent work 
such as Kingma and Welling E), Rezende et al. Q, Gregor et al. (23), Bayer and Osendorfer (24). 
Two additional parallels between our work and recent advances for training deep neural networks 
can be observed. First, the idea of enforcing desired transformations in latent space during learning 
- such that the data becomes easy to model - has appeared several times already in the literature. 
This includes the development of transforming auto-encoders (25) and recent probabilistic models 
for images (261127) . Second, learning relations between pairs of images - although without control - 
has received considerable attention from the community during the last years [ 28 , 29]. In a broader 
context our model is related to work on state estimation in Markov decision processes (see Langford 
et al. (30) for a discussion) through, e.g., hidden Markov models and Kalman filters EDE). 

5 Conclusion 

We presented Embed to Control (E2C), a system for stochastic optimal control on high-dimensional 
image streams. Key to the approach is the extraction of a latent dynamics model which is constrained 
to be locally linear in its state transitions. An evaluation on four challenging benchmarks revealed 
that E2C can find embeddings on which control can be performed with ease, reaching performance 
close to that achievable by optimal control on the real system model. 
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A Supplementary to the E2C description 

A.l State transition matrix factorization and KL Divergence 

As alluded to in the main paper, estimation of the full local state transition matrix A t £ M n * xn * 
from Equation (8) requires the transition network to predict n z x n z parameters. Using an arbitrary 
state transition matrix also - inconveniently - requires inversion of said matrix for computing the KL 
divergence penalty from Equation (11) (through which it is hard to backpropagate). We started our 
experiments using a full matrix (and only approximating all KL divergence terms), but quickly found 
that a rank one pertubation of the identity matrix could be used instead without loss of performance 
in any of our benchmarks. To the contrary, the resulting networks have fewer parameters and are 
thus easier to train. We here give the derivation of this process and how the KL divergence from 
Equation (11) can be computed. For the reformulation we represent A t as A t = I + v t rJT, therefore 
only v t and r t need to be estimated by the transition network, reducing the number of outputs for 
A t from n 2 z to 2 n z . 

The KL divergence between two multivariate Gaussians is given by 
KL(A/o||M) = \ (Tr (SJ-'So) + (Mi - - Mo) - k + log (• (13) 

For a simplified notation, such that KL(A/o||A/i) = KL(Q\\Q), let us assume 

Mo = W(/i 0 , AS 0 A r ) = A t S t Af) = Q, 

Mi = J\T (mi.S i) = W(Mt+ii ^t+i) = Q- 


The main point behind the derivation presented in the following, is to make partial derivatives of the 
above KL divergence efficiently computable. To this end, we cannot take the trace or the determinant 
via numerical algorithms, because we have to be able to take the gradients in symbolic form. Aside 
from that, we like to process a batch of samples, so the computation should have a convenient form 
and not require excessive amounts of tensor products in between. We start our simplification with 
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the trace term which results in 


Tr^So) 


Tr (Sf 1 AE 0 A T ) 

Tr (2^(1 + vr T )S 0 (I + vr T ) T ) 

Tr ((S^ 1 + Sr'vr 7 ) (S 0 + £o(vr T ) T )) 

Tr (S^ 1 S 0 + S^ 1 So(vr T ) T + EJ- 1 vr r E 0 + Sr 1 vr T S 0 (vr T ) T ) 

Tr (A + B) = Tr(A) + Tr (B) 


IV (E^Eo) + Tr (EJ*E 0 (vr T ) T ) + IV (Ef 1 vr r E 0 ) + IV (E^vi^Eorv 3 

Tr(y4BC) = Tr {CAB) = . 


^2 

E 


a O,i r i V i 


M 


+ ^^^ +Tr(vTE _ lvr T Eor) 


'1 ,i 


cr o }i + 2ag 




The last equation is easy to implement and only requires summing over the non-batch dimension. 
The difference of means can be derived very quickly with the same summing scheme: 

(Ml - MofsrVt - Mo) = E 

i 

It remains the ratio of determinants, which we will simplify with the matrix determinant lemma 
giving 


log 


det Ei 
det AEq A T 


= log det Ei — log det (AE 0 A t ) 

= log IR< — log (det A • det E 0 • det A T ) 

i 

= 2 e log - io s ^( det a ) 2 n 

= 2 E !°g- lo S (! + vTr ) 2 - 2 E log 

i i 

= 2 (e ( log °m _ log <i) - log ( 1 + E ViTi ) 


det A T = det A 


Matrix determinant lemma 


Putting the above to formulas together finally yields 

1 /\ <To,i T 2(J 0 ,j l; * r i 


KL(V 0 ||M)=» E 1 




■E’M-ESe 


(Mi ~ Mo)? _ 

IT? _ 


(14) 


2 (E ( logcr M - lo g<i) - M 1 + E 8 ^) 


B Supplementary to the experimental setup 

B.l Up-convolution 

We used convolutional inference networks for the cart-pole and three-link arm task. While these 
networks help us overcome the problem of large input dimensionalities (i.e. 2 x 128 x 128 pixel 
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images in the three-link arm task), we still have to generate full resolution images with the decoder 
network. For high-dimensional images generation fully connected neural networks are simply not an 
option. We thus decided to use up-convolutional networks, which were recently show to be powerful 
models for image generation El0 133]. 

To set-up these models we basically “mirror” the convolutional architecture used for the encoder. 
More specifically for each 5x5 convolution followed by 2 x 2 max-pooling step in the encoder 
network, we introduce a 2 x 2 up-sampling and 5x5 convolution step in the decoder network. 
The complete network architecture is given below. It is similar to the up-convolution networks used 
in Dosovitskiy et al. The upsampling strategy we use is simple “perforated” upsampling as 
described in El- 

B.2 Variational Autoencoder with slowness 

Enforcing temporal slowness during learning has previously been found to be a good proxy for 
learning representations in reinforcement learning mm and representation learning from videos 
E). We also consider a VAE variant with a slowness term on the latent representation by enforcing 
similarity of the encodings of temporally close images. This can be achieved by augmenting the 
standard VAE objective £ bound with an additional KL divergence term on the latent posterior Q^\ 


£ slow (x t ,x t+ i) = BCL(<3^(z t+1 |xt + i)||Q^(z t |x t )). 


(15) 


Indeed there seems to be a slightly better coherence of similar states in the latent spaces, as e.g. 
depicted in Figure [8] in the main paper. Yet, our experiments show that a slowness term alone does 
not suffice to structure the latent space, such that locally linear predictions and control become 
feasible. 

B.3 Evaluation criteria 

For comparing the performance of all variants of E2C and the baselines, the following criteria are of 
importance: 

• Autoencoding. Being able to reconstruct the given observations is the basic necessity for 
a model to work. The reconstruction cost drives a model to identify single states from its 
observations. 

• Decoding the next state. For any planning to be possible at all, the decoder must be able 
to generate the correct images from transitions the dynamics model performed. If this is 
not the case, we know that the latent states of the encoding and the transition model do not 
coincide, thus preventing any planning. 

• Optimizing latent trajectory costs. The action sequences for achieving a specified goal 
will be determined completely by locally linearized dynamics in the latent space. Therefore 
minimizing trajectory costs in latent space is, again, a necessity for successful control. 

• Optimizing real trajectory costs. While the action sequence has been determined for 
the latent dynamics, the deciding criterion is whether this reflects the true state trajectory 
costs. Therefore carrying out the ’’dreamed” plans in reality is the optimality criterion for 
every model. To make the different models comparable, we use the same cost matrices for 
evaluation, which are not necessarily the same as for optimization. 

We reflected these four criteria in the evaluation table in the paper. For the reconstruction of the 
current and next state we specified the mean log loss, which is in case of the Bernoulli distributions 
the cross entropy error function: 


AT n . 



(16) 


For the costs a model imagines and truly achieves, we sample from different starting states and 
accumulate the distances in latent and true state space according to the SOC method. 
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B.4 The three-link robot arm 


The robot arm we used in the last experiment in the main paper was simulated using dynamics gen¬ 
erated by the MapleSim http : //www. maplesoft. com/product s/maplesim/ simulator 
wrapped in Python and visualized for producing inputs to E2C using PyGame. We simulated a fairly 
standard robot arm with three links. The length of the links were set to 2, 1.2 and 0.7 (units were set 
to meters). The masses of the corresponding links were all set to 10 kg. 

B.5 Evaluating the true system model 

To compare the efficacy of different models when combined with optimal control algorithms, we 
always reported the cost in latent space (as used by the optimal control algorithm) as well as the 
“real” trajectory cost. To compute this real cost, we evaluated the same cost function as in the latent 
space (quadratic costs on the deviation from a given goal state), but using the real system states 
during execution and different cost matrices for a fair comparison. 

As an upper bound on the performance achievable for control by any of the models, we also com¬ 
puted the true system cost by applying iLQR/AICO to a model of the real system dynamics. We 
have this model available since all experiments were performed in simulation. 

B.6 Neural Network training 
B.6.1 Experimental Setup 

All the datasets were created in advance as V = {(xi, ui, x 2 ),..., (x T -i, ut-i, x T )} for the 
training, validation and test split. While the E2C models were trained on V , the ones that do not in¬ 
corporate any transition information (i.e. AE, VAE) were trained on images images = {xi,..., xt} 
extracted from the original dataset V. The slowness VAE was trained on the pairs of images subset 
72pairs = {(xi, x 2 ),..., (xt-i, xt)} and our E2C models on the full V. 

In order to learn dynamics predictions for the image-only autoencoders, we extracted 
the latent representations and combined them with the actions from V into ^dynamics = 
{(zi, ui, z 2 ),..., (z T -i, Ut-i, zt)}. On these low-dimensional representations we trained the 
dynamics MLPs, thus ensuring that all methods were trained on exactly the same data. 

B.6.2 Implementation details 

We used orthogonal weight initialization for every layer (38). As described in the main paper, 
Adam fl4l was used as the learning rule for all networks. We found both these techniques to be 
fundamentally important for stabilizing training and achieving good reconstructions for all methods. 
Both methods also clearly helped to cut the hyperparameter search needed for all methods to a 
minimum. In the process of training, we could make out three phases: the unfolding of the latent 
space, the overcoming of the trivial solution (the average image of the dataset) and the minimization 
of the latent KL term. The architectures used for our experiments were as follows (where ReLU 
stands for rectified linear units [ 39] and conv. for convolutions): 

Plane 

• Input: 40 2 image dimensions, 2 action dimensions 

• Latent Space dimensionality: 2 

• Encoder: 150 ReLU - 150 ReLU - 150 ReLU - 4 Linear (2 for AE) 

• Decoder: 200 ReLU - 200 ReLU - 1600 Linear (Sigmoid for AE) 

• Dynamics: 100 ReLU - 100 ReLU + Output layer (except Global E2C) 

- AE, VAE, VAE with slowness, Non-linear E2C: 2 Linear 

- E2C: 8 Linear (2 • 2 for A t , 2 • 1 for B t , 2 for o t ), A = 0.25 

• Adam: a = 10 —4 , /5 2 = 0.1 

• Evaluation costs: =0.1-1, R n = I, R 0 = I 

Pendulum swing-up 
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• Input: 2 • 48 2 image dimensions, 1 action dimension 

• Latent Space dimensionality: 3 

• Encoder: 800 ReLU - 800 ReLU - 6 Linear (3 for AE) 

• Decoder: 800 ReLU - 800 ReLU - 4608 Linear (Sigmoid for AE) 

• Dynamics: 100 ReLU - 100 ReLU + Output layer (except Global E2C) 

- AE, VAE, VAE with slowness, Non-linear E2C: 3 Linear 

- E2C: 12 Linear (2 • 3 for A t = (I + v t rf), 3 • 1 for B*, 3 for b t ), A = 0.25 

• Adam: a = 3 - 10 -4 , /3 2 = 0.1 

• Evaluation costs: H z = I, R n = 0.11 


Cart-Pole balancing 

• Input: 2 • 80 2 image dimensions, 1 action dimension 

• Latent Space dimensionality: 8 

• Encoder: 32 x 5 x 5 ReLU - 32 x 5 x 5 ReLU - 32 x 5 x 5 ReLU - 512 ReLU - 512 ReLU 

• Decoder: 512 ReLU - 512 ReLU -2x2 up-sampling - 32 x 5 x 5 ReLU -2x2 up-sampling 
- 32 x 5 x 5 ReLU -2x2 up-sampling - 32 x 5 x 5 conv. ReLU 

• Dynamics: 200 ReLU - 200 ReLU + 32 Linear (2 • 8 for A t = (I + v t rf), 8 • 1 for B t , 8 
for b t ), A = 1 

• Adam: a = 10 _4 ,/3 2 = 0.1 

• Evaluation costs: H z = I, R n = I 


Three-link arm 

• Input: 2 • 128 2 image dimensions, 3 action dimensions 

• Latent Space dimensionality: 8 

• Encoder: 64 x 5 x 5 conv. ReLU -2x2 max-pooling - 32 x 5 x 5 conv. ReLU -2x2 
max-pooling - 32 x 5 x 5 conv. ReLU -2x2 max-pooling - 512 ReLU - 512 ReLU 

• Decoder: 512 ReLU - 512 ReLU -2x2 up-sampling - 32 x 5 x 5 ReLU -2x2 up-sampling 
- 32 x 5 x 5 ReLU -2x2 up-sampling - 64 x 5 x 5 conv. ReLU 

• Dynamics: 200 ReLU - 200 ReLU + 48 Linear (2 • 8 for A t = (I + v t rf), 8 • 3 for B t , 8 
for b t ), A = 1 

• Adam: a = lO -4 ,/^ = 0.1 

• Evaluation costs: H z = I, R n = 0.0011 



True State 


AE 


VAE 


VAE with slowness 


Non-linear E2C 


Global E2C 


E2C 


Figure 5: Generated “dreamed” trajectories of different models for the plane task (from left to right). 
The opacity of the obstacles has been lowered in this depiction for better visibility of the agent. 
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C Supplementary evaluations 

C.l Trajectories for plane and pendulum 

To qualitatively measure the predictive accuracy, the starting state for a trajectory is encoded and the 
actions are applied on the latent representation. After each transition, the predicted latent position 
is decoded and visualized. In this manner, multi-step predictions can be generated for the planar 
system in Figure [5] and for the inverted pendulum in Figures [6] and [7] 
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Figure 6: Generated “dreamed” trajectories (from left to right) for passive dynamics: the pendulum 
starts with angle 0 = — | without velocity. The models have to predict the dynamics, while no force 
is applied. 
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Figure 7: Dreamed trajectories (from left to right) for controlled dynamics: the pendulum starts 
with angle 6 = | without velocity. For 6 timesteps, full force is applied to the right, followed by 4 
timesteps of full force to the left. 

C.2 Inverted pendulum latent space 

Encoding the pendulum depictions into a 3-dimensional latent space allows for a visual comparison 
in Figure [8]. 
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Figure 8: Latent spaces of all baseline models and E2C variants for the inverted pendulum. 


C.3 Trajectories for cart-pole and three-link arm 

Finally - similar to the images in Section [Cl] - Figure [9] shows multi-step predictions for the cart- 
pole system. We depict important cases: (1) a long-term prediction with the cart-pole standing 
still (essentially the unstable fix-point of the underlying dynamics); (2) the cart-pole moving to the 
right, changing the direction of the poles angular velocity (middle column); (3) and the pole moving 
farthest to the right. The long-term predictions by the E2C model are all of high quality. Note that 
for the uncontrolled dynamics the predictions show a slight bias of the pole moving to the right (an 
effect that we consistently saw in trained models for the cart-pole). We attribute this problem to the 
fact that discretization errors in the image rendering process of the pole angle make it hard to predict 
small velocities accurately. 

C.4 Exemplary trajectory taken for three-link arm task 

Figure [TO] shows a segment of a controlled trajectory for the three-link arm as executed by the E2C 
system. Note that, in contrast to other figures in this supplementary material, it does not show a 
long-term prediction but rather 10 steps of a trajectory (together with one-step-ahead predictions) 
that was taken by the E2C system when combined with model predictive control. For additional 
visualizations and controlled trajectories for all tasks we refer to the supplementary video. 

C.5 Comparison of different models for cart-pole and robot arm 

In Table [2] we compare our variety of models in terms of real trajectory cost and task success per¬ 
centage on the cart-pole and the robot arm. All results are averaged over 30 different starting states 
with a fixed goal state. 

The cart-pole always starts in the goal state (zero angle and zero velocity) with small additive Gaus¬ 
sian noise (a = 0.01). Success is defined as preventing the pole from falling below an angle of 
±0.85 rad. The three-link arm system begins in a random configuration and the goal is to to unroll 
all joints (e.g. make all angles zero) and stay e-close to that position. 

The results show that only E2C and its non-linear variant can perform this task successfully, although 
there is still a large performance gap between the two. We conclude, that the error of linearizing 
non-linear dynamics after training the corresponding model grows to the point of no longer allowing 
accurate control for the system. 
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No Control 


Moving right Moving left 


Real Generated 


Real Generated 


Real Generated 



Figure 9: Dreamed trajectories (top to bottom) for uncontrolled (left column) and controlled (mid¬ 
dle/right column) dynamics in the cart-pole system. The red image shows the initial configuration, 
which is encoded resulting in z±. The images in the right half of each column are then generated 
without additional input by following the dynamics in latent space. The left column depicts the un¬ 
controlled case (u = 0 for all steps). The middle column shows a controlled trajectory with torque 
—20 applied in each step and the right column a trajectory with torque 20 applied in each step. 
Prediction of the history image is omitted in these depictions. 


Table 2: Comparison between trajectory costs of different approaches for the cart-pole and three- 
link task. The standard Autoencoder, Variational Autoencoder and Global E2C model are omitted 
from the table as they failed on this task (performance similar to VAE with slowness). 


Algorithm 

True model 

VAE + slownes 

E2C no latent KL 

Non-linear E2C 

E2C 

Traj. Cost 

15.33 ± 7.70 

49.12 ± 16.94 

Cart-Pole balance 

48.90 =b 17.88 31.96 ± 13.26 

22.23 ± 14.89 

Success % 

100% 

0% 

0% 

63% 

93% 

Traj. Cost 

59.46 

1275.53 ± 864.66 

Three-link arm 

1246.69 ± 262.6 460.40 zb 82.18 

90.23 ± 47.38 

Success % 

100% 

0% 

0% 

40% 

90% 
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Table 3: Comparison between AICO and iLQR based on the “real” cost for controlling the cart-pole 
and three-link robot arm using convolutional networks. 

Method iLQR AICO 

Cart-Pole 

~E2C 14.56 ±4.12 12.56 ± 2.47 

True model 7.45 ± 1.22 7.03 ± 1.07 

Three-Link Robot Arm 
T32C 93.78 ± 32.98 92.99 ±20.12 

True model 53.59 ± 9.74 56.34 ± 10.82 


C.6 Comparison of trajectory optimizers for cart-pole and robot arm 

To compare how well AICO deals with the covariance matrices estimated in latent space we per¬ 
formed an additional experiment on the cart-pole and three-link robot arm task comparing it to iLQR. 
We performed model predictive control using the locally linear E2C model starting in 10 different 
start states each. The remaining settings are as given in Section |C3] 

As reported in Table [3] both methods performed about the same for these tasks, indicating that the 
covariance matrices estimated by our model do not “hurt” planning, but considering them does not 
improve performance either. 
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Figure 10: Frames extracted from a trajectory (top to bottom) as executed by the Embed to Control 
system. The left column shows the real images corresponding to transitions taken in the MDP. 
Middle and right column show the prediction of history image and current image based on the 
previous two images. 
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